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Abstract.  Web  Intelligence  is  gaining  its  growth  in  a  rapid  speed.  The 
notion  of  wisdom,  which  is  considered  as  the  next  paradigm  shift  of 
WI,  has  become  a  hot  research  topic  in  recent  years.  The  basic  appli¬ 
cation  of  wisdom  is  making  a  short  conversation  in  an  interactive  and 
understandable  way  based  on  the  huge  web  resources.  However,  current 
conversation  system  normally  applies  the  recognition  of  semantic  simi¬ 
larities  in  the  prepared  database,  neglecting  the  true  intention  hiding  in 
the  expression.  In  this  paper,  we  present  a  model  based  on  the  medical 
Q&A  knowledge  base  to  overcome  this  challenge.  The  knowledge  base 
includes  three  parts:  disease  entity,  medicine,  properties.  A  simple  graph 
path  algorithm  based  on  words  direction  and  relation  weight  adjustment 
is  used  to  realize  conversation  intention  perception.  The  experimental 
results  show  that  this  method  can  effectively  perceive  types  of  inten¬ 
tion.  This  method  can  also  be  applied  in  deep  understanding  of  other 
intelligent  systems  such  as  classifications  and  text  mining. 
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1  Introduction 

Web  Intelligence  (WI)  is  a  new  direction  of  academic  research  and  industry 
development.  The  main  duty  of  WI  is  making  use  of  various  web  information 
and  knowledge  in  a  professional  and  effective  way  based  on  technologies  such  as 
knowledge  discovery,  data  mining,  intelligent  agents  as  well  as  advanced  infor¬ 
mation  technology  [1].  In  the  area  of  WI  technology,  the  notion  of  wisdom  [2]  is 
gaining  much  attention  in  scientific  research.  In  a  simple  practice,  the  concept  of 
wisdom  contributes  to  the  conversation  system,  in  which  person  and  computer 
act  in  an  unobstructed  and  easy  manner  just  like  the  communication  between 
human  beings.  This  application  needs  to  grasp  the  real  intention  of  human’s  sen¬ 
tences  accurately  and  comprehensively,  which  means  to  understand  and  know 
what  his  true  demand  is. 

One  traditional  conversation  system  is  to  measure  the  semantic  similarities 
between  human  inputs  [3],  which  is  not  trivial  to  realize  but  the  performance  is 
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unsatisfied  when  the  input  has  a  little  word  overlap.  In  addition,  the  implication 
of  conversation  always  depends  on  the  keyword’s  assembly  among  sentences.  For 
example,  about  the  health  care  consult,  a  patient  mentioned  the  symptom  or  his 
information  in  a  pretty  long  sentence.  After  all,  he  made  clear  name  of  medicine 
and  want  to  know  whether  the  medicine  is  beneficial  for  curing  the  disease,  or 
whether  it  may  bring  side  effect  to  his  current  condition.  The  traditional  conver¬ 
sation  system  would  determine  the  patient  is  talking  about  disease  according  to 
the  symptoms  and  would  recommend  the  most  common  treatment.  The  prob¬ 
lem  is  that  these  systems  cant  identify  customer’s  real  intention  from  the  given 
information. 

The  most  popular  online  medical  answering  or  guiding  systems  are  mainly 
relied  on  manual  consult  in  China.  Health  care  system  and  the  modern  health  in¬ 
frastructure  play  an  essential  role  in  recent  years  [4].  However,  self-management 
for  health  care  has  two  challenges:  a)  building  a  health  knowledge  base  with  com¬ 
prehensive  diseases,  medicine  information  automatically.  In  recent  years,  more 
and  more  knowledge  bases  with  massive  data  are  building  up,  such  as  Wikipedia 
Wordnet  Baike  ^  and  so  on.  Most  of  these  knowledge  bases  are  established 
by  manually  editing,  b)  developing  an  intelligent  consulting  system  which  could 
detect  customer’s  intention  and  provide  some  treatment  recommendations  with¬ 
in  a  short  conversation.  The  applications  of  knowledge  base  for  WI  are  still  very 
few. 

In  this  paper,  we  use  massive  health  care  Q&A  data  to  build  a  health  knowl¬ 
edge  base  and  develop  a  conversation  intention  perception  system  in  Chinese. 
The  knowledge  base  includes  three  parts:  disease  entities,  medicine  entities  and 
symptom  properties.  The  associated  relation  links  between  them  are  created  ac¬ 
cording  to  a  simple  graph  path  algorithm.  We  use  a  content  center  detection 
algorithm  based  on  the  knowledge  graph  to  estimate  the  conversation  intention. 
The  experimental  results  show  that  this  method  can  effectively  perceive  require¬ 
ment  types.  This  method  can  also  be  applied  in  deep  understanding  of  other 
intelligent  systems  such  as  classifications  and  text  mining.  The  main  contribu¬ 
tions  of  this  paper  are  outlined  as  follows: 

•  Based  on  medical  entities,  we  extracted  disease  entities,  medicine  entities, 
symptom  entities  from  online  resources  using  keyword  extraction  and  feature 
selection  method. 

•  According  to  the  associated  relations  between  keywords  in  a  sentence,  we 
proposed  an  automatic  knowledge  base  building  approach.  We  extend  the 
association  relation  between  entities  nodes  in  the  built  knowledge  base  to 
construct  the  relation  map  and  weights  between  nodes. 

•  According  to  the  knowledge  graph  path  and  relation  weight,  we  identify  the 
conversation  intention  within  a  short  conversation. 

The  main  organization  of  this  paper  is  listed  as  follows.  Section  2  discusses 
the  most  related  works,  including  stat-of-the  art  approaches  on  intention  per- 

^  http://www.wikipedia.org/ 

^  http://wordnet.princeton.edu/ 

®  http://www.baike.coni/ 
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ception  and  traditional  conversation  system.  We  describe  the  data  collection  and 
knowledge  base  construction  in  section  3.  The  intention  perception  algorithm  is 
explained  in  section  4.  Section  5  illustrates  the  experimental  performance  and 
evaluates  several  factors,  which  may  affect  the  performance.  We  summary  the 
paper  with  discussion  on  future  work  in  section  6. 


2  Related  Work 

Conversation  system  aims  at  finding  similar  context  in  existing  data  set.  Earlier 
works  mainly  focus  on  using  the  semantic  similarities  between  sentences  such  as 
the  overlap  coefficient,  Dice  coefficient  and  Jaccard  coefficient  to  get  the  desir¬ 
able  result  [5].  To  solve  the  problem  that  the  above  methods  work  poorly  when 
there  is  little  word  overlap  between  queries,  latter  researches  have  achieved  big 
progress  using  the  statistical  techniques  of  information  retrieval.  Jeon  et  al.[6] 
study  automatic  methods  of  finding  semantically  similar  question  pairs  based  on 
the  assumption  that  similar  answers  lead  to  approximate  questions.  Ko  et  al.[7] 
apply  answer  relevance  and  answer  similarity  into  the  statistical  model,  and  he 
made  an  improvement  to  this  model  considering  correlation  of  the  correctness 
of  answer  candidate  [8].  These  systems  mainly  rely  on  the  semantic  similarities 
of  human  inputs  and  neglect  the  user’s  intention  implicated  in  them. 

Intention  perception  is  the  key  technology  for  the  conversation  system  since 
the  understandable  machine  performs  well  returning  the  answer  [9] .  It  is  a  tough 
work  considering  the  various  human  actions,  and  most  of  its  researches  are  ap¬ 
plied  in  the  academic  field  of  human-robot  [10].  One  of  the  main  obstacles  is  that 
user’s  intention  recognition  contains  the  uncertainties,  and  Jeon  et  al.[ll]  pro¬ 
poses  an  ontology-based  approach  to  minimize  them.  Some  other  research  works 
apply  the  machine  learning  method  to  solve  the  issue.  Kuan  et  al.[12]  use  the 
Support  Vector  Machine(SVM)  and  Linear  Regression  as  two  steps  to  identify 
human  intention.  Hofmann  et  al.[13]  adopt  the  Bayesian  belief  networks  to  form 
the  intention  model.  These  methods  have  been  proved  effective  in  their  domain. 

Although  context  semantic  and  machine  learning  approaches  have  good  per¬ 
formance  in  simple  dialogue,  it  is  still  very  difficult  to  deal  with  the  new  knowl¬ 
edge  growing.  On  the  other  hand,  the  context-based  conversation  intention  ap¬ 
proach  can’t  associate  the  current  knowledge  with  linked  or  similar  knowledge 
as  humans.  Therefore,  we  use  a  knowledge  base  as  the  fundamental  element  to 
attempt  conversation  intention  perception. 


3  Knowledge  Base  Building 

Now,  several  common  and  large  knowledge  bases  such  as  Wikipedia,  MozillaZine 
Probase  [14],  GeoNames^  and  WordNet  [15]  have  been  set  up  manually  or 

http:/ /kb. mozillazme.org/Knowledge_Base 
®  http://www.geonames.org/ 
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semiautomatically.  Here  we  use  massive  Q&A  data  set  ®  to  build  a  content 
based  knowledge  base.  We  use  distributed  web  clawer  to  download  target  web 
page  and  tools  like  DOMTree  to  translate  the  gained  Q&A  information  into  the 
XML  format. 

3.1  Q&A  Archive 


URL 

http://120ask.com/question/34672281.html 

Question  Title 

Body  itch 

Question  Body 

When  the  summer  comes,  my  body  itch  and  exists  red  dot... 

Requirement 

What  disease  it  is 

Answers 

It  may  relate  to  allergy  which  is  caused  by  summer  insects... 

Table  1.  structure  of  question  and  answer  pair 


The  Q&A  archive  we  collected  are  organized  in  Table  1.  These  Q&A  pairs 
in  our  experiment  are  all  Chinese.  In  this  paper,  we  translate  the  Chinese  words 
into  English  to  make  our  examples  more  clear.  Each  item  in  archive  has  5  fields: 
URL  part  is  the  unique  identifier  of  question  and  answer  pair.  Question  Title  is 
the  short  description  for  the  question  and  Question  Body  gives  a  detail  statement 
about  question.  It  is  the  basic  data  for  our  experiment.  The  average  length  of 
question  body  is  48  words  in  Chinese.  Part  Requirement  represents  the  kind 
of  help  questioner  is  looking  for.  Therefore,  this  part  contains  the  standard  for 
questions’  intention  classihcation.  The  last  part  answers  is  the  selected  answer 
from  several  candidates  for  the  corresponding  question  and  its  average  length  is 
108  words  in  Chinese. 

We  collected  30  million  Q&A  pairs  from  the  web  and  divided  them  into  two 
collections:  25  million  pairs  for  the  training  data  and  5  million  for  the  testing 
data.  We  use  the  requirement  part  to  mark  the  training  data  and  testing  data. 
Phrases  such  as  ’’what  disease”  or  ”how  to  cure”  or  ’’what  medicine”  or  ’’nega¬ 
tive  influence”  in  the  requirement  part  are  applied  to  mark  the  question  to  its 
corresponding  category.  As  not  all  the  people  fill  in  the  requirement  part,  finally 
we  receive  nearly  1  million  marked  training  data  and  200  thousand  marked  test¬ 
ing  data  for  our  experiment.  In  the  preprocess  procedure  of  data  set,  we  remove 
some  redundant  words  from  the  questions,  such  as  stop-words,  digits  and  links. 

3.2  Knowledge  Base 

To  build  up  the  knowledge  base  for  our  experiment,  we  need  to  collect  three 
types  of  medical  entities:  disease  entity,  medicine  entity  and  symptom  entity. 
And  then  the  relation  between  them  is  formed.  The  established  knowledge  base 
is  the  preparation  and  fundamental  element  for  the  next  stage  of  experiment. 


http:/ /www.  120ask.com 
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•  Disease  and  Drug  Entities:  The  former  two  entities  are  professional  words 
and  obtained  by  web  crawling.  Baidu  Encyclopedia  (BE)^  is  an  open  content 
online  encyclopedia  which  covers  all  areas  of  knowledge  in  Chinese.  The 
Maximum  entropy  classifier  is  adopted  to  classify  those  entries  into  large 
mount  of  categories  using  structural  information  since  the  BE  pages  are 
well  tagged.  After  receiving  the  labeled  entities,  we  get  nearly  25000  disease 
entities  and  9800  medicine  names. 

•  Symptom  Entity:  Since  most  symptom  entities  are  not  professional  words 
and  happen  in  the  oral  presentation,  they  can  not  be  easily  and  accurately 
discovered  from  the  professional  encyclopedia  web  sites.  We  extract  symptom 
entities  from  the  collected  question  and  answer  pairs  based  on  the  assumption 
that  most  symptoms  appear  many  times  in  oral  presentation  since  patients 
usually  have  limited  words  to  describe  their  diseases.  Thus  we  extract  the 
phrases  exist  frequently  in  question  and  answer  pairs  and  then  combine  the 
phrases  with  the  adverbs  of  positive  and  negative  words.  After  the  artificial 
selection  we  get  nearly  3000  symptom  entities. 

•  Relation  Map:  For  the  knowledge  base  building  several  relationships  are 
identified:  diseases  have  corresponding  symptoms,  diseases  can  be  cured  by 
corresponding  medicine,  symptoms  can  be  cured  by  corresponding  medicine. 
Thus,  the  Q&A  pairs  are  used  since  the  relationship  of  entities  is  hiding  in 
them.  We  assume  that  the  more  frequently  entities  appear  simultaneously 
in  the  Q&A  pair,  the  more  likely  they  are  connected.  The  bigger  frequency 
is,  the  closer  their  relationship  is.  After  the  filter  process,  we  build  up  the 
relation  map  among  these  entities. 

4  Intention  Perception 

The  basic  assumption  of  our  model  is  using  the  medical  knowledge  base  and 
relation  map  to  adjust  the  keywords  weights  of  different  category  intention  based 
on  correlative  strength  and  graph  path.  We  assume  that  the  entity  which  receives 
more  connections  from  other  entities  is  more  important  in  the  conversation. 
Therefore,  the  more  entities  connected  to  the  current  phrase,  the  more  weight 
value  will  be  added  to  the  current  phrase. 

The  intention  perception  problem  is  actually  a  dynamic  classification  prob¬ 
lem.  We  divide  the  medical  questions  into  four  types  of  intention,  they  are  listed 
as  follows: 

•  askers  are  willing  to  know  what  disease  it  may  be 

•  askers  are  willing  to  know  how  to  cure  the  disease  or  the  described  symptom 

•  askers  are  willing  to  know  the  medicine  to  cure  the  disease  or  the  symptom 

•  askers  are  willing  to  know  the  negative  influence  of  mentioned  medicine 

4.1  Weight  Adjustment 

The  relation  map  of  entities  based  on  the  knowledge  base  we  established  before 
is  shown  in  Fig.  1.  The  double-ended  arrow  represents  the  two  entities  are  con- 

^  http://baike.baidu.com/ 
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Fig.  1.  The  relation  map  of  entities. 


nected  directly,  and  the  digit  stands  for  the  co-occurrence  frequency.  Firstly,  we 
compute  the  distance  between  two  entities.  For  example,  distance  between  entity 
Gastritis  and  entity  Vomit  is  considered  as  one  as  they  are  connected  directly, 
distance  between  entity  Gastritis  and  entity  Fracture  is  two  since  they  are  con¬ 
nected  through  entity  Pain,  while  there  are  three  units  of  distance  between  entity 
Gastritis  and  entity  Bleeding.  And  the  distance  of  two  entities  has  a  limitation 
of  four.  Two  entities  are  connected  within  the  shortest  path.  Secondly,  when  a 
query  is  given,  we  use  Jieba  Participle®  to  depart  the  question  into  phrases.  The 
initial  weight  of  each  phrase  is  endowed  as  one.  Thirdly,  to  a  certain  entity,  the 
direct  and  indirect  connected  entities  make  a  contribution  to  its  weight  value,  we 
call  it  the  contribution  value.  The  closer  distance  and  bigger  co-occurrence  of  two 
entities  both  devote  to  larger  contribution  value.  The  formulation  to  compute 
contribution  value  is  as  follows: 

(l-:p^Uyx  {XxY  e  R,  R=  <l,a>,<2,b>,<3,c>,<4,d>)  (1) 

V  lOgFe/ 

where  Ff.  is  the  co-occurrence  frequency  of  two  entities,  Yx  is  the  initial  contri¬ 
bution  value  to  the  corresponding  distance  value  X. 

Considering  the  fact  each  kind  of  entities  stands  for  different  character  of 
given  conversation,  for  example,  in  the  situation  of  medical  system,  although  a 
symptom  entity  and  a  medicine  entity  both  connect  to  a  disease  entity  direct¬ 
ly,  their  influence  to  the  weight  of  disease  entity  is  not  the  same,  we  call  this 
influence  the  contribution  multiplier.  Fig.  2  shows  the  contribution  multiplier 
we  settled  in  our  model.  For  instance,  the  contribution  multiplier  of  medicine 
to  disease  is  a,  then  in  turn  the  contribution  multiplier  disease  to  medicine  is 
Vl  —  a^. 

®  https:/ /pypi.python.org/ pypi/jieba/ 
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Fig.  2.  The  contribntion  multiplier  between  connected  entities. 


When  come  to  the  situation  that  two  entities  are  from  the  same  category, 
their  distance  is  at  least  two  as  they  can  not  be  connected  to  each  other  directly. 
Their  contribution  multiplier  is  as  follows: 

m—1 

contributionmultiple{Ai,  Am)  =  contributionmultiple{Ai,  Ai+i)  (2) 

where  stand  for  the  two  entities  from  the  same  category,  they  are  con¬ 

nected  by  the  path  from  A2  to  Am-i- 

combining  the  contribution  value  and  contribution  multiple,  we  set  up  com¬ 
putational  method  of  entity  weight,  it  is  shown  as  follows: 

n 

weight{wi)  =  initialweight  +  contibution  value  *  contribution  multiple 

(3) 


4.2  Language  Model 

Language  Model  (LM)  can  be  either  probabilistic  or  non-probabilistic.  The  prob¬ 
abilistic  language  model  is  widely  used  in  the  field  of  data  mining  and  natural 
language  processing.  In  this  paper,  we  adopt  a  probabilistic  model  to  complete 
the  classification  task.  First,  we  estimate  the  probability  the  subsequence  of 
words  relate  to  the  category.  Then  rank  the  probability  value  and  deem  the 
category  which  has  the  highest  probability  is  the  one  this  question  belong  to. 

We  use  cl,  c2,  c3,  c4  to  denote  different  types  of  intention.  In  order  to  classify 
the  given  question  to  which  category,  we  need  to  get  the  question  likelihood 
computed  by  P{q\c),  and  the  formulation  is  as  follows: 

N 

Pr{q\c)  ='Y2Pr{Wi\c)  (4) 

i=l 

where  N  represents  the  number  of  words  in  the  query  and  Pr{wi\c)  stands 
for  the  probability  word  Wi  occurs  in  current  category  c.  The  formulation  we 
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use  is  a  multinomial  distribution  which  indicates  that  the  distribution  of  each 
phrase  in  the  question  is  generated  independently,  they  obey  the  same  probability 
distribution.  In  order  to  compute  Pr{wi\c),  we  assemble  all  the  questions  from  the 
same  type  to  one  synthedic  document.  Then  the  maxmium  likelihood  estimate 
(MLE)  is  adopted  which  computes  the  probability  as  follows: 

P^{w^\c)  =  ^  (5) 

In  the  formulation,  the  Fic  represents  the  count  of  training  data  items  which 
Wi  exits  and  Fc  is  the  the  size  of  training  data  of  the  current  category. 

The  probability  of  words  need  to  be  normalized  to  make  the  sum  of  them  in 
all  the  categories  to  be  one.  Let  Su  =  P{w\ci)  be  the  normalization  factor, 

then  we  recalculate  the  probability  of  words  in  the  following  rule: 


Pr{w\Ci) 


Pr{w\Ci) 


Pr{w\Ci) 

Y.^=lPriw\Ci) 


(6) 


4.3  Combined  Model 

Our  model  combines  the  weight  we  have  endowed  to  each  phrase  in  the  given 
conversation  and  the  probability  language  model.  The  weight  represents  the 
importance  of  each  attribute  to  the  classihcation.  In  other  words,  a  word  with 
higher  weight  contributes  more  than  others  to  the  probability  estimation  in  the 
classihcation[I6].  The  formulation  in  our  model  is  shown  as  follows: 

D  ^„l„^  _  J2f=iiweight{wi)  *  Pr{w^\c)) 

Li=i  weight[wp 


5  Experimental  Results 

5.1  Data  Set 

In  the  data  preparing  process,  we  collect  nearly  1  million  training  data  and 
200  thousand  testing  data  to  evaluate  the  proposed  method.  Actually,  the  data 
set  for  the  four  types  of  conversation  is  not  evenly  distributed,  especially  for  the 
fourth  category  which  people  are  looking  for  the  negative  of  medicine.  The  detail 
of  training  data  corpus  is  shown  in  Table  2.  As  for  the  testing  data  corpus,  to 
be  even,  we  equally  divided  it  into  three  testing  data  sets,  each  contains  2000 
articles  for  each  category  and  8000  pairs  for  the  whole.  In  the  later  experiment, 
we  will  use  these  three  data  sets  to  make  a  comparison  to  ensure  our  model’s 
performance. 
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5.2  Evaluation  Measure 

In  our  experiment,  we  compare  our  method  with  two  basic  methods,  BOOL 
and  TF-IDF.  These  three  methods  included  all  rely  on  the  Equation  (7),  while 
they  differ  from  each  other  the  weight  value  in  the  equation.  The  BOOL  method 
treats  each  phrase  in  the  sentence  equally.  Thus  the  weights  of  phrases  are  all 
be  endowed  as  one.  Method  TF-IDF  is  very  common.  It  uses  the  term  frequency 
and  inverse  category  frequency  value  of  words  as  its  weight  value.  Speaking  of 
the  evaluation  metrics,  accuracy  is  adopted  which  is  commonly  seen  in  the  field 
of  data  mining  and  statistics.  Accuracy  is  a  measure  of  the  percentage  that  the 
testing  data  is  correctly  classified. 

5.3  Comparison  of  Methods 

To  adopt  our  method,  since  some  parameters  are  involved  in  the  equation  we 
first  need  to  give  some  certain  value  to  these  parameters.  In  this  paper,  the 
initial  contribution  value  [a,  5,  c,  d]  is  fixed  and  regarded  as  [0.75,0.5,0.25,0]. 
While  parameters  a,  /3, 7  are  variable  in  the  range  of  [0, 1].  Later  we  will  adjust 
these  variable  parameters  to  make  our  model  better  suit  to  intention  perception 
in  the  question. 

In  the  given  testing  data,  as  the  incomplete  of  knowledge  base  we  have  es¬ 
tablished,  it  is  a  fact  that  there  might  be  no  entity  in  the  knowledge  base  can 
be  found  in  the  question  or  the  found  entities  have  no  connection  between  each 
other.  Facing  these  situations,  our  model  will  give  each  word  the  weight  of  one 
just  as  the  BOOL  method  does.  To  see  how  our  method  works  in  the  testing 
data  which  only  involves  connected  entities,  we  remove  the  testing  items  which 
contain  the  above  features  and  finally  get  nearly  II  thousand  testing  data  pairs 
to  form  the  fourth  testing  corpus.  Thus  the  four  testing  corpus  we  use  are  as 
follows:  the  former  three  each  contains  8  thousand  items  and  the  fourth  one 
contains  items  which  have  connected  entities. 

Fig.  3  shows  the  comparison  of  three  methods  applied  in  the  four  different 
testing  data  corpus.  In  the  experiment,  the  parameters  a,/3,7  are  0.7, 0.9, 0.9 
respectively  since  they  achieved  the  best  result  after  several  tests.  From  the 
figure  we  find  that  the  TF-IDF  method  works  no  better  than  BOOL  method 
which  is  reasonable  as  TF-IDF  method  is  not  effective  in  the  keyword  extraction 
when  the  sentence  is  short.  While  our  model  performs  much  better  than  these 
two  methods  especially  when  the  data  set  only  contains  the  questions  which  have 
connected  entities  as  shown  in  the  fourth  histogram.  It  proves  that  the  method 
we  proposed  can  effectively  grasp  the  central  topic  of  question  and  get  to  know 
people’s  intention  more  accurately  than  the  other  two  methods. 


Category 

Typel 

Type2 

Types 

Type4 

Total 

Data  Size 

165422 

454036 

240580 

53036 

913074 

Table  2.  training  data  corpus 
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Fig.  3.  The  comparison  of  three  methods  in  the  human’s  intention  understanding 


Fig.  4.  The  influence  of  al-  Fig.  5.  The  influence  of  be-  Fig.  6.  The  influence  of 
pha  Parameter  ta  Parameter  gamma  Parameter 


5.4  Parameters  Evaluation 

As  mentioned  before,  the  model  we  adopted  in  this  paper  has  a  fixed  set  as 
a,  &,  c,  d  while  a,/3,7  are  variable  to  optimize  intention  perception  results.  Fig. 

4.5  and  6  demonstrate  how  the  three  parameters  influence  the  accuracy  of  in¬ 
tention  perception  work.  In  every  figure,  the  other  two  parameters  are  fixed  to 
a  static  value  as  0.7,  so  that  the  contribution  multiple  between  each  other  is  the 
same.  It  implies  that  the  entities  of  medicine  or  disease  tend  to  receive  bigger 
contribution  multiple  parameter  compare  to  entities  of  symptom  which  is  ratio¬ 
nal  since  they  exist  less  frequently  than  entities  of  symptom  in  a  single  question. 
Thus  the  former  two  kinds  of  entities  are  more  presentative  and  should  get  a 
bigger  contribution  multiplier. 


5.5  Sentence  Length  Effect 

As  we  know,  most  sentences  in  conversation  system  are  short.  The  number  of 
keyword  still  fluctuates  within  a  certain  range.  It  is  meaningful  to  measure  how 
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the  three  methods  work  when  the  number  of  words  in  question  ranges  in  a  given 
interval.  We  divide  the  testing  data  according  to  their  length  by  steps  of  20  words. 
From  Fig.  7,  we  easily  discovery  that  our  method  performs  better  than  the  other 
two  when  the  number  of  words  are  neither  too  small  nor  too  big.  The  small 
one  devotes  to  limited  number  of  entities  while  the  big  one  contains  too  much 
information  which  easily  makes  some  words  over-weighted.  The  performance  of 
three  methods  were  all  very  low  because  it’s  difficult  to  extract  entities.  While 
the  sentence  length  is  over  100,  the  increase  is  not  so  significant  for  knowledge 
base. 


The  Number  of  Chinese  Word  in  Question 


Fig.  7.  The  comparison  of  three  methods  in  question  of  different  length 


6  Conclusion 

In  this  paper,  we  crawled  massive  health  conversation  content  to  build  a  health 
care  knowledge  base.  After  word  segmentation,  keywords  were  extracted  and 
symptom  entities  were  selected  using  the  feature  candidate  algorithm.  The  health 
care  knowledge  was  built  based  on  the  association  relation  between  diseases, 
medicine  and  symptom  entities.  We  proposed  a  simple  graph  path  and  weight 
calculation  algorithm  to  modify  the  association  relation  and  transmission  weights 
to  estimate  the  intention  center  words.  We  used  a  Bayesian  model  to  estimate 
the  customers  intention  within  short  content  conversation.  Finally,  we  illustrated 
several  experimental  results  with  effectively  perceived  intention  types. 

Since  the  real  conversation  system  likes  a  catch  ball  game,  we  will  devote  this 
model  to  build  an  interactive  dialogue  system.  Furthermore,  we  would  introduce 
living  place,  hospital  name  and  age  stage  to  enrich  the  knowledge  base.  And  this 
method  would  be  extended  to  other  content  areas  such  as  travel  consult  and 
social  network. 
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